On the interpretation of transcriptome-wide association studies

Transcriptome-wide association studies (TWAS) aim to detect relationships between gene expression and a phenotype, and are commonly used for secondary analysis of genome-wide association study (GWAS) results. Results from TWAS analyses are often interpreted as indicating a genetic relationship between gene expression and a phenotype, but this interpretation is not consistent with the null hypothesis that is evaluated in the traditional TWAS framework. In this study we provide a mathematical outline of this TWAS framework, and elucidate what interpretations are warranted given the null hypothesis it actually tests. We then use both simulations and real data analysis to assess the implications of misinterpreting TWAS results as indicative of a genetic relationship between gene expression and the phenotype. Our simulation results show considerably inflated type 1 error rates for TWAS when interpreted this way, with 41% of significant TWAS associations detected in the real data analysis found to have insufficient statistical evidence to infer such a relationship. This demonstrates that in current implementations, TWAS cannot reliably be used to investigate genetic relationships between gene expression and a phenotype, but that local genetic correlation analysis can serve as a potential alternative.


Introduction
Transcriptome-wide association studies (TWAS) [1][2][3][4][5][6][7][8] constitute a statistical framework commonly used to study relationships between gene expression and a phenotype.Combining expression quantitative trait loci (eQTL) and genome-wide association study (GWAS) results, a TWAS analysis estimates the genetic component of the gene expression and then tests whether this is associated with the phenotype.Since TWAS does not require that gene expression and the phenotype are measured in the same sample, it can be used with any phenotype for which GWAS data is available, giving it a broad scope of application.
From a significant TWAS association, it would be tempting to conclude that there is a genetic relationship between the expression of the tested gene and the phenotype of interest, and this interpretation is indeed commonly found in applied TWAS studies [9][10][11][12][13].Such an interpretation is also suggested by various TWAS method papers such as Gamazon (2015) [7] and Mancuso (2019) [8], with for example the latter stating that "[TWAS] can be viewed as a test for non-zero local genetic correlation between expression and trait." However, the traditional TWAS framework does not directly test the relationship of the phenotype with the genetic component of the gene expression.Rather, it only tests the relationship with the predicted genetic component of the gene expression, without accounting for the uncertainty in that prediction [6,7].It thereby omits the fact that the gene expression data is also the result of a sampling process from the analysis.Consequently, as will be shown, the test of association between that predicted genetic component and a phenotype reduces to merely a (weighted) test of joint association of the SNPs with the phenotype, which means that they cannot be used to infer a genetic relationship between gene expression and the phenotype on a population level.
In this study, we show that this distinction has important implications for the interpretation and utility of TWAS results.First, we dissect the mathematical structure of the TWAS framework, explaining the exact null hypothesis that is tested in TWAS analysis, and then demonstrating how this relates to the genetic relationship between gene expression and the phenotype.Second, we use large-scale simulations to demonstrate that if we were to incorrectly interpret TWAS as testing such genetic relationships, this can result in potentially highly inflated type 1 error rates.Finally, we use a real data application with five well-powered phenotypes to illustrate what these error rate inflations can look like in practice, showing that TWAS cannot be used as a reliable tool for detecting genetic relationships between gene expression and a phenotype of interest.

The TWAS framework
The traditional TWAS framework is implemented as a two-stage procedure, as follows [1,6,7].First, a model is specified for the expression level E of a particular gene where X denotes the genotype matrix of SNPs local to that gene (typically within one megabase from the transcription region of the gene), and ξ E a residual component independent of X.Note that any trans-effects on E of SNPs elsewhere on the genome will be absorbed into ξ E , insofar as those SNPs are independent of X.We can further define G E = Xα E , which reflects the true local genetic component of the gene expression captured by the SNPs in X.This model is fitted to a sample with genotype and gene expression data to obtain an estimated genetic effect vector âE .With this, the estimated genetic component of the gene expression is then computed in the GWAS sample for the outcome phenotype Y as ĜE ¼ Xâ E .Finally, a linear regression model of the form with coefficient β and residual ε Y , is used to test the relationship between the estimated genetic component ĜE and the phenotype Y, evaluating the null hypothesis of no association H 0 : β = 0. Note that our presentation here is simplified for the sake of brevity, which includes assuming a continuous phenotype and omitting covariates.The second stage can also be rewritten to require only pre-existing GWAS summary statistics rather than raw genotype and phenotype data as input.See Methods -Outline of TWAS framework for a description of the more general case.
In practice, most existing TWAS methods have this mathematical structure [6][7][8][14][15][16][17][18][19][20][21][22][23][24], including all the 'linear' models listed in Table 1 aside from CoMM [25].Where they differ is in their implementation, particularly in how âE is estimated [6][7][8][14][15][16][17][18][19][20][21][22][23][24][25].Since the number of SNPs in X is usually much larger than the sample size of the gene expression data, and are also highly collinear, it is not possible to fit the model in (1) using standard procedures such as Ordinary Least Squares.Most TWAS implementations therefore use a penalized regression approach such as Elastic Net, LASSO or a spike-and-slab model to resolve this issue (see Table 1).Although different kinds of penalized regression procedures will also yield differ different estimates âE , all these methods are still conceptually equivalent, each providing an estimate ĜE of the genetic component of the gene expression for use in the second stage.
Not all existing methods adopt this structure, such as the methods listed as 'non-linear' in Table 1, but these methods run into the same issues that will be discussed below for the traditional 'linear' TWAS framework (see Non-linear TWAS models in S1 Text for more details).Throughout this paper, unless otherwise specified, we will use TWAS to refer specifically to the linear TWAS framework as described above.

Structure of the TWAS null hypothesis
TWAS evaluates the null hypothesis H 0 : β = 0 for the model presented in Eq (2), which can be rewritten to mathematically equivalent formulations to provide a clearer understanding of what is being tested.Firstly, we can observe that b ¼ covð ĜE ;YÞ varð ĜE Þ , and therefore β will be zero if and only if the covariance covð ĜE ; YÞ is zero.As such, the TWAS null hypothesis is equivalent to H 0 : covð ĜE ; YÞ ¼ 0.
To better understand the nature of the covariance term covð ĜE ; YÞ, we can partition the phenotype Y in the same way as we did with E in Eq (1), such that Analogous to G E , this gives us a local genetic component G Y for the phenotype that is based on the SNPs in X, and a residual ξ Y that is independent of X (that is, cov(X j , ξ Y ) = 0 for every SNP j).
where the last step follows from the distributive property of covariance).The covariance covð ĜE ; YÞ can thus be partitioned into two parts, of which the second part covð ĜE ; x Y Þ must be zero.This follows from the fact that ĜE ¼ Xâ E ¼ P j X j âEj (with j indexing the individual SNPs), and therefore As noted in the previous paragraph, cov (X j , ξ Y ) = 0 for every SNP j, and as such covð ĜE ; x Y Þ ¼ 0 as well.With that, we can conclude that covð ĜE ; YÞ ¼ covð ĜE ; G Y Þ.In other words, since G Y captures all the association that the SNPs in X have with the phenotype Y, and ĜE is entirely based on X, the covariance between ĜE and the residual term ξ Y will be zero by definition.The result of this is that we can further rewrite the TWAS null hypothesis into

Relation to local genetic covariance
Local genetic correlation, analogous to genome-wide genetic correlations, quantifies the genetic relationship between two phenotypes within a particular genomic region, with gene expression, in this context, being one of those phenotypes [26,27].Since the local genetic correlation can be defined as covðG E ;G Y Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi For ease of notation, we will denote Ĉ ¼ covð DE ; G Y Þ.We can thus see that the covariance term covð ĜE ; G Y Þ that is tested by TWAS is offset from the local genetic covariance cov(G E , G Y ) by Ĉ, and plugging this relationship into H 0 : covð ĜE ; G Y Þ ¼ 0, we find that the TWAS null hypothesis is equivalent to Crucially, the regression model in Eq (2) conditions on its predictor ĜE , thus considering the value of ĜE to be fixed.It follows that the values of the deviation term DE , and therefore that of Ĉ, are fixed as well, even though DE is the result of the random sampling of E in the gene expression data.As such, the quantity À Ĉ that TWAS implicitly tests cov(G E , G Y ) against, will have an unknown, arbitrary value (different for each gene) that reflects the specific error in the estimation that was realized when the gene expression E was generated, with the distribution from which this value is drawn also determined by choice of model used for estimating α E (for more details, see The TWAS null value in S1 Text).
To further illustrate this, consider an analogy to the t-test.Suppose we are interested in testing whether the population means μ A and μ B of some variable for groups A and B differ.However, instead of performing a two-sample t-test with H 0 :μ A = μ B , or equivalently H 0 :μ A = μ B = 0, we compute the sample mean mB for group B and use a single sample t-test with H 0 : m A ¼ mB .Let us assume that in this case, in our sample from group B the realized error in the estimate mB is 1, ie. mB ¼ m B þ 1.This means that with our single sample t-test, we would be testing H 0 :μ A = μ B = 1.As with TWAS, because we are disregarding the estimation uncertainty in mB when performing our test, we end up testing the quantity of interest, μ A = μ B , against an unknown null value, which isn't very informative.

Interpretation of TWAS results
As shown above, TWAS implicitly tests the local genetic covariance cov(G E , G Y ) against À Ĉ (for an unknown value of Ĉ) rather than zero, and as such it cannot be considered a direct test of the genetic relationship between gene expression and the phenotype.This raises the question of how TWAS results actually can be interpreted, which we can address by determining the circumstances in which the TWAS null hypothesis As shown above, the null hypothesis Examining the structure of Ĉ more closely we can see that it decomposes into a weighted sum Ĉ ¼ P j ŵj covðX j ; YÞ of the (population-level) covariances cov(X j , Y) of the SNPs in X with the phenotype (see The TWAS null value in S1 Text).These weights ŵj reflect the error in the effect size estimate for each SNP j, ie.âEj ¼ a Ej þ ŵj , and as such their values are the realization of a random process.
It follows that, if there is any genetic association between the SNPs in X and the phenotype Y, the TWAS null hypothesis will only be true if by chance the values of ŵj are such that P j ŵj covðX j ; YÞ happens to equal −cov(G E , G Y ).Because the value of P j ŵj covðX j ; YÞ is effectively a random draw from a continuous space, the probability of this happening is infinitessimal.Realistically, the TWAS null hypothesis will therefore only be true if there is no genetic association between X and Y, with cov(X j , Y) = 0 for every SNP j.In this case, P j ŵj covðX j ; YÞ will be zero as well, and at the same time the absence of such genetic signal also means that the variance of G Y is zero, and therefore no covariance with G E can exist.As such, in this case both cov(G E , G Y ) and Ĉ will be zero, and In practice, the TWAS null hypothesis will therefore be true if, and only if, the SNPs in X are all independent of the phenotype, and a test of this null hypothesis therefore amounts to a joint genetic association test, ie. a test of independence between X and Y akin to testing H 0 :α Y = 0 in the general regression model in Eq (3), though with somewhat different power characteristics.Although as shown, the TWAS null hypothesis H 0 : covðG E ; G Y Þ ¼ À Ĉ being true does imply that the null hypothesis H 0 :cov(G E , G Y ) = 0 of no local genetic relationship is true as well, the reverse does not hold.If genetic association does exist between X and Y, then H 0 : covðG E ; G Y Þ ¼ À Ĉ will be false, but as long as this genetic association is independent of the genetic signal captured by G E , then cov(G E , G Y ) will still be zero.This is why it does not follow from the TWAS null hypothesis being false that a local genetic relationship exists between gene the expression and the phenotype.
Despite this, the intuition may remain that even though ĜE is uncertain, it is still based on the true G E and should therefore still contain some information relevant to the genetic relationship between gene expression and the phenotype.And this is indeed the case, because as shown the tested covariance covð ĜE ; But although present, this information cannot meaningfully be extracted from the TWAS results, since even when covð ĜE ; G Y Þ is found to deviate significantly from zero, the TWAS cannot tell us whether this is due to a strong local genetic relationship through cov(G E , G Y ), or just through the noise term Ĉ.

TWAS as a test of genetic relationship
As established above, TWAS does not technically provide a direct test of the genetic relationship between gene expression and the phenotype, since its null hypothesis is equivalent to with the value of Ĉ unknown.However, it is possible that in the scenarios that are likely to arise in practice, the value of Ĉ might be so small as to be negligible.If so, we could simply treat TWAS as if it is testing the null hypothesis H 0 :cov(G E , G Y ) = 0 after all, and interpret its significant results as indicative of genetic relationships between gene expression and the phenotype without issue.
In the remainder of this paper, we will evaluate the viability in practice of treating TWAS as a test of the null hypothesis H 0 :cov(G E , G Y ) = 0, first using large-scale simulations to determine the resulting type 1 error rates across a range of different scenarios, and then using an application to real data to illustrate the impact in practice.We used FUSION [6] for this evaluation, as it is a widely used tool, and it implements a range of different models within the same tool that are representative of those used in many other TWAS methods (see Table 1).We also included CoMM [25], since it uses a setup that is somewhat different than that of the other TWAS methods, and closer to that of local genetic correlation methods [26,27] (see Comparison with the CoMM model in S1 Text).Note that reported type 1 error rates will be relative to H 0 :cov (G E , G Y ) = 0, rather than the null hypotheses these methods are designed to test, as the aim is to evaluate the viability of using these methods when they are treated as a test the genetic relationship between gene expression and the phenotype (as they are often interpreted in practice).
To serve as a reference, we included the local genetic correlation model implemented in LAVA (denoted LAVA-rG), which is designed and calibrated to test the null hypothesis H 0 : cov(G E , G Y ) = 0 directly.Furthermore, we also implemented a TWAS model in the LAVA framework (denoted LAVA-TWAS), which is identical to LAVA-rG in every way, except that like other traditional TWAS methods it treats ĜE as given (rather than modelling its distribution under sampling of the gene expression E, like LAVA-rG does; see Methods -LAVA implementation of TWAS).This implementation allows for a direct comparison between testing

Simulation study
To evaluate the performance of TWAS when interpreted as a test of the genetic relationship between gene expression and outcome phenotype, we performed a series of simulations, generating gene expression and phenotype values under the null hypothesis H 0 :cov(G E , G Y ) = 0 and computed type 1 error rates.Genotype data from the UK Biobank [28,29] was used as a basis for the simulations, and we varied the number of SNPs per simulated gene, as well as the sample size and local heritability of the gene expression and outcome phenotype data (see Methods -Simulation study).Analysis was performed using LAVA-TWAS and LAVA-rG, as the four different models in FUSION (BLUP, Bayesian LMM, Elastic Net, LASSO), and CoMM.
The simulations showed considerable inflation of type 1 error rates (α = 0.05) when using LAVA-TWAS to evaluate H 0 :cov(G E , G Y ) = 0, depending on the simulation condition.The inflation strongly increased with larger local heritability or sample size for the outcome phenotype (Fig 1 ), with the local heritability also interacting with the number of SNPs in the gene (Fig A in S1 Text).In addition, some decrease in the type 1 error rate inflation was also found with greater local heritability or sample size for the gene expression, but this effect is much less pronounced.This is in part because in the typical sample size range of existing individual eQTL data sets, the relative contribution of these parameters to the overall variance is dominated by that of other parameters (see also the attenuation factor A discussed in Underlying distributions in S1 Text).At higher sample sizes for the gene expression, changes in both the sample size and the heritability of the gene expression would have a larger impact.
The simulations also show that the type 1 error rate inflation also became progressively worse when evaluating it at stricter significance thresholds than the α of 0.05 used so far (Fig 2).Type 1 error rates also increased when filtering out genes that exhibit insufficient genetic signal for the gene expression, as is general practice in TWAS analysis (Fig B in S1 Text).To validate the LAVA-TWAS model, type 1 error rates for LAVA-TWAS under its actual null hypothesis of H 0 : covð ĜE ; G Y Þ ¼ 0 were also assessed.These were found to be well controlled in all conditions (Fig C in S1 Text), further emphasizing the disconnect between this null hypothesis and the null hypothesis of no genetic relationship.
Results for the four FUSION penalized regression models are largely the same as for LAVA-TWAS (Fig 3), suggesting that the type of model and penalization used for the estimation of âE does not strongly affect the subsequent type 1 error rate inflation.Although the Elastic Net and LASSO models did show lower inflation than other models in the 1% gene expression heritability conditions, this can be explained by the fact that they frequently failed to converge in these conditions when estimating Eq (1).This difference was no longer present in the 5% heritability conditions (see  Overall, the results of our simulations demonstrate that treating TWAS analysis as if they were a test of the null hypothesis of no genetic relationship H 0 :cov(G E , G Y ) = 0 can cause considerable inflation of the type 1 error rates in many conditions.A strong effect was observed for the sample size and heritability of the outcome phenotypes, in line with the interpretation of TWAS as a joint association test.Of particular concern is that the inflation increases when filtering on the univariate signal of the gene expression, which is commonly done in TWAS  studies.The inflation also increases when using lower significance thresholds, which is likely to apply in practice to correct for multiple testing, and often at more stringent levels than shown in our simulations.See Discussion of simulation results in S1 Text for more discussion of the findings from these simulations.

Real data analysis
To gauge the magnitude of the impact this issue has in a real data context, we applied TWAS to GWAS results of five well-powered phenotypes [30][31][32][33] (see Table 2), with eQTL data for blood gene expression from GTEx [34] (v8).A total of 14,584 genes were analysed, performing TWAS analysis for those genes for which the univariate gene expression signal was significant at α BONF = 0.05/14,584 = 3.43×10 −6 (see Methods -Data and Methods-Real data analysis).The data was analysed with the Elastic Net and LASSO models in FUSION and LAVA-TWAS.We analyzed the data with LAVA-rG as well to provide a baseline for the LAVA-TWAS results, and determine for how many associations the H 0 :cov(G E , G Y ) = 0 null hypothesis can be validly rejected.
A summary of the results is given in Table 2 and as shown, the number of significant LAVA-TWAS associations is considerably higher than those for LAVA-rG, with on average only about 59% of LAVA-TWAS associations confirmed by LAVA-rG.From this it follows that if we were to interpret LAVA-TWAS as a test of genetic relationship between gene expression and the phenotype, 41% of the associations found would not be valid.This is because LAVA-rG is calibrated for testing H 0 :cov(G E , G Y ) = 0, and LAVA-TWAS only differs from LAVA-rG in the fact that it conditions on ĜE whereas LAVA-rG models the distribution of the estimation uncertainty in ĜE that we know exists.This single difference is the cause of the inflated type 1 error rates observed for LAVA-TWAS (but not LAVA-rG) in the simulations.Consequently, because there is no other difference between the two, the 41% of significant LAVA-TWAS associations not shared with LAVA-rG cannot be due to a legitimate increase in power, meaning that they are not valid if interpreting LAVA-TWAS as a test of H 0 :cov(G E , G Y ) = 0 (see also Underlying distributions in S1 Text).
Although no equivalent reference model directly testing H 0 :cov(G E , G Y ) = 0 is available for the FUSION models, the FUSION results paint a similar picture.The number of associations found when using FUSION are at a comparable level to LAVA-TWAS (Tables 2 and A in S1 Text), and the FUSION models had very similar type 1 error rate inflation levels as LAVA-TWAS in the simulations as well.Taken together, it can reasonably be concluded that the proportion of FUSION results in the real data analysis that are not supported by sufficient statistical evidence, when interpreted as a test of the genetic relationship between gene expression and the phenotype, is likely at a very similar level as LAVA-TWAS.We also evaluated results for genes with a univariate p-value for the gene expression greater than 0.05.As shown in Table B in S1 Text, virtually no local genetic correlations are found with LAVA-rG for these genes, yet results for the TWAS analyses remain at a similar level as in the primary analysis in Table 2.Because these genes have no detectable genetic component for the gene expression, they would typically not be included in a TWAS analysis.But the fact that TWAS still yields a considerable number of significant associations in this context, when âE reflects essentially just noise, serves to further illustrate how TWAS simply functions as a form of joint association test, with results not necessarily reflecting any genetic relationship between gene expression and the outcome phenotype.

Deflation of standard errors
Fundamentally, the statistical reason behind the results in our simulations and real data analysis is that the TWAS model omits a source of variance that we know is present.The estimated genetic effect vector âE is subject to estimation uncertainty due to the sampling of the gene expression E, but this uncertainty is not accounted for in the TWAS test.If the aim is to evaluate the genetic relationship with the phenotype, by testing the null hypothesis H 0 :cov(G E , G Y ) = 0, the resulting standard errors will therefore only reflect the uncertainty due to sampling of Y, and not that of E. As such, the standard errors will be too small, creating a downward bias in the p-values (see Underlying distributions in S1 Text for more details).
For LAVA-TWAS the corrected standard errors are known, as these are the LAVA-rG standard errors, and we can therefore use these to illustrate the issue.We computed this standard error deflation individually for each gene in our analysis, as the ratio of the LAVA-TWAS standard errors to their corrected value.These are summarized in Table 3 and, as shown, they vary considerably across genes, with negligible deflation for some genes but severe deflation for others.

Discussion
In this paper, our aim was to elucidate the interpretation of the TWAS null hypothesis and results.Although TWAS results are often interpreted as indicating a genetic relationship between gene expression and the phenotype, we showed that this is not what the TWAS null hypothesis allows us to test.Instead, the null hypothesis that the traditional TWAS models actually evaluate effectively makes them a test of joint genetic association between the SNPs included in the analysis and the phenotype, irrespective of any genetic relationship with the gene expression.However, because it is not uncommon for TWAS results to be misinterpreted as a test of such a genetic relationship, we aimed to determine to what extent this interpretation was viable in practice.To this end, we performed simulations to determine the degree of type 1 error rate inflation that occurs when treating TWAS results as a test of genetic relationship, and performed real data analysis to demonstrate what this translated to in practice.Our results showed that using TWAS to infer genetic relationships between gene expression and the phenotype can lead to strongly inflated type 1 error rates, as well as a substantial proportion of invalid significant results, for which such a genetic relationship is insufficiently supported by the statistical evidence in the data.This was further emphasized by the fact that in our real data analysis, TWAS still yielded a considerable number of significant associations even when genetic signal for the gene expression was absent altogether, clearly illustrating how TWAS functions simply as a form of joint association test.The issue is further complicated by the fact that the degree to which standard errors are deflated varies across genes, depending on factors such as the gene size and level of genetic signal for both gene expression and phenotype.3).As shown, the relative bias is stronger the lower the p-value.For example, at the median standard error deflation of 0.89, at a reference p-value of 0.05 the biased p-value is about 0.028, about a 1.8-fold decrease in the p-value relative to its true value.At a reference p-value of 0.0001 however, the biased p-value is about 1.2×10 −5 , an 8.1-fold decrease.In the right panel is the general relationship between the true reference p-value and the p-value that would result at different levels of standard error deflation, based on the percentiles for the BMI analysis.This further illustrates the relative increase in bias for lower p-values, with the gap with the reference p-value (black line) getting larger at lower reference pvalues.Note that all axes with p-values use a -log10(p-value) scaling, but labels are expressed as p-values for ease of interpretation.This therefore rules out the possibility of a simple post-hoc correction on existing TWAS results, making them difficult to interpret and use in practice.Although TWAS results can still be validly interpreted as a form of joint association test for the genetic relationship between the SNPs and the phenotype, as out results show, it does not allow for any conclusions to be drawn about the relationship of the phenotype with the gene expression, and therefore does not address the research questions that most researchers are looking to answer with such data.
While the issue of uncertainty in eQTL estimates has occasionally been mentioned in TWAS literature [1,25,35,36], its implications for the validity of the TWAS framework have received little scrutiny thus far.Moreover, although various TWAS methods have included simulations to evaluate type 1 error rates, these all either treated ĜE as fixed in the simulations just as in the TWAS model itself [16,23], or only simulated under a null scenario of the phenotype having no genetic component at all (ie.G Y = 0) [22,24,25].This explains why the issue demonstrated in this paper has not yet been widely discussed.
Addressing this issue with TWAS generally requires fully accounting for the uncertainty of the estimate ĜE in relation to the true genetic component G E .While it is in principle possible to incorporate this into existing TWAS models, in practice, the feasibility of implementing this will strongly depend on the approach used to fit the model in Eq (1), as the distribution of the noise component induced by penalized regression models can take mathematically very complex forms.
Another option is to use alternative methods to detect genetic relationships between gene expression and a phenotype.As demonstrated, local genetic correlation analysis methods could potentially fill this role, as these are explicitly designed to evaluate local genetic relationships between phenotypes.However, methods like colocalization [37,38] or Mendelian Randomization (MR) [39][40][41] could potentially be used for this as well.These methods use quite different model assumptions, but test null hypotheses that can answer similar research questions pertaining to local genetic relationships between gene expression and phenotypes.Of note however is that the issue with TWAS described in this paper has a clear analogy to the issue of the "no measurement error" (NOME) assumption in MR analysis [42].While the practice of selecting only highly significant instruments could reduce the impact of the NOME assumption, further evaluation would be needed to determine whether this same issue would still arise when using MR as an alternative to TWAS.
Investigating genetic relationships between gene expression and phenotypes, as TWAS is often intended to do, has the potential to provide valuable insights into the genetic etiology of many phenotypes.But although traditional TWAS models are statistically valid relative to their own null hypothesis, ultimately they only perform a test of joint association of the phenotype with the SNPs in a region, and thus cannot be used to reliably draw conclusions about any genetic relationship with gene expression.Further development of existing TWAS methods to account for this issue, or adoption of alternative methods such as local genetic correlation analysis, will therefore be essential to support future investigations of the relationship between gene expression and phenotypes.

Ethics statement
This study relied on simulated data based on the UK Biobank genotype data [28], as well as secondary analysis of publicly available summary statistics.Ethical approval for this data was obtained by the primary researchers, and as such no further ethical approval was required for this study.

Outline of TWAS framework
For a standardized genotype matrix X containing SNPs for a particular gene, and with E and Y the centered gene expression for that gene and outcome phenotype respectively, the general TWAS framework consists of a two-stage procedure, based on the Eqs (1) and ( 2) in the main text, with the estimated genetic component ĜE ¼ Xâ E , also given in the main text.For ease of notation, we have omitted model intercepts and covariates from these equations, but in practice these will usually be included.An alternative model may also be used rather than the linear regression in Eq (2), such as a logistic regression if the phenotype is binary.
For each gene and tissue, Eq (1) is first fitted to the eQTL data to obtain the estimated weight vector âE .This is then used to compute ĜE in the target GWAS sample in the second stage, and plugged into the linear model in Eq (2).A p-value is then obtained by performing a test on the coefficient β.Note that usually the second stage is only performed for genes and tissues that exhibit sufficient genetic association in the eQTL data.This second stage can also be rewritten in terms of GWAS summary statistics, allowing TWAS to be performed without having direct access to the GWAS sample.In this case a genotype matrix X obtained from a separate reference sample is used to estimate LD.
Which SNPs are included in X varies, but a common choice is to use all available SNPs within one megabase of the transcription region of the gene.Although for simplicity the same genotype matrix X is used for Eqs (1) and ( 2), there will be separate X genotype matrices for each sample.The analysis is therefore restricted to using only those SNPs that are available in both samples, as well as in the LD reference sample when using summary statistics as input.
In practice, Eq (1) cannot be fitted with a traditional multiple linear regression model, due to the high LD between SNPs (leading to extreme collinearity), and the number of SNPs typically exceeding the sample sizes of eQTL data.Some form of regularization in the regression model is therefore required to obtain âE , and consequently one of the main discrepancies between TWAS implementations is the specific regularization used (see Table 1).In some cases, rather than fitting Eq (1), the elements of âE are simply set to the marginal SNP effect estimates instead.
Note that some methods [22][23][24] diverge from this linear model structure (Table 1, non-linear models).Statistically these can be seen as generalizations of the TWAS framework, though conceptually they can no longer be interpreted as imputing the genetic component of gene expression.See Non-linear TWAS models in S1 Text for more details.

Local genetic correlation
The LAVA implementation of local genetic correlation analysis has been described in detail in Werme et al. (2022) [26].In brief, LAVA uses summary statistics and a reference genotype sample to fit Eqs ( 1) and ( 3), obtaining estimates of âE and âY as well as a corresponding sampling covariance matrix for each (a logistic regression equivalent is used for binary phenotypes).To do so, a singular value decomposition for X is computed, pruning away excess principal components to attain regularization of the models and allowing them to be fitted.
With the pruned and standardized principal component matrix W = XR (with R the transformation matrix projecting the genotypes onto the principal components), we can write G E = Wδ E and G Y = Wδ Y , where δ E and δ Y are the genetic effect size vectors for these principal components.Their estimates dE and dY can be used to obtain âE and âY by reversing the transformation through R, such that âE ¼ R dE and âY ¼ R dY , with the projection to the principal components effectively providing a form of regularization in the estimation of α E and α Y .In practice however, LAVA is defined and implemented directly in terms of δ E and δ Y and its estimates, rather than working with α E and α Y explicitly.
For ease of notation, we define the combined matrix G = (G E , G Y ) = Wδ for the genetic components, with combined effect size matrix δ = (δ E , δ Y ).We denote the effect sizes for a single principal component j as δ j , corrresponding to the jth row of γ, and denote the number of principal components as K.
The estimates dE and dY are obtained by reconstructing multiple linear regressions from the input summary statistics (see Werme et al. (2022) [26] for details).This uses two separate equations of the form E = Wδ E +z E and Y = Wδ Y +z Y , analogous to Eqs (1) and ( 3) but regressing on W rather than X, with residual variances Z 2 E and Z 2 Y for z E and z Y .From these models we have estimates of the form dE ¼ ðW E and ŝ2 Y and the off-diagonal elements are 0 (in the general case the off-diagonal elements represent the sampling covariance resulting from sample overlap, but this is not present in the analyses in this study).Since for the covariance matrix of G, denoted O, we have , it follows that inference on O can be performed using the sampling distributions for dE and dY directly.Using this model, separate univariate tests of joint association of the SNPs in X with E and Y can be performed, testing the null hypotheses d E ¼ !0 and d Y ¼ !0 respectively (using standard linear regression F-test for continuous phenotypes (such as gene expression), or a χ 2 test for binary phenotypes).This is equivalent to testing the local genetic variances o 2 E and o 2 Y , the diagonal elements of O.In general, it is recommended in LAVA to test both genetic variances before performing the bivariate analysis, since genetic covariance can only exist in a genomic region where there both phenotypes exhibit some degree of genetic variance, though analogous to standard practice in TWAS literature in this paper we only test on the genetic variance for the gene expression.
From the above distributions it follows that the expected value E½ dT d� ¼ d T d þ K Ŝ, and we can therefore use the method of moments to estimate O as Ô ¼ dT d À K Ŝ.Since in the present analyses there is assumed to be no sample overlap, the off-diagonal elements of S are 0, and the estimate for the genetic covariance therefore reduces to ôEY ¼ dT E dY .The matrix dT d has a non-central Wishart sampling distribution, which in the current implementation of LAVA is used to obtain p-values to test H 0 :ω EY = cov(G E , G Y ) = 0 using a simulation procedure (see Werme et al. (2022) for details).Note that the LAVA model as described in Werme et al. ( 2022) is technically defined in terms of test- However, under this null hypothesis ôEY has an expected value of zero and a variance of Although it is a sum of product-normal distributions, with large enough K its distribution converges on a normal distribution following the central limit theorem, allowing for a normal approximation of the sampling distribution of ôEY .For the simulations and real data analysis we computed LAVA-rG p-values in both ways, and as shown in Fig G in S1 Text these were found to be virtually identical.As such, all LAVA-rG results reported in this paper are based on p-values computed using the distribution ôEY � Nð0; V rG Þ, to provide a mathematically more direct comparison for the LAVA-TWAS implementation.

LAVA implementation of TWAS
To construct a TWAS model within the LAVA framework, we note that in a linear regression for Eq (2) we have estimator b ¼ ĉ ovð ĜE ;YÞ v arð ĜE Þ .Since v arð ĜE Þ is considered fixed in TWAS the sampling distribution of b directly proportional to the distribution of ĉ ovð ĜE ; YÞ, the sample estimate of covð ĜE ; YÞ.As both ĜE and Y have means of zero, and since ĜE ¼ W dE , we have , and it therefore follows that ĉ ovð ĜE ; YÞ ¼ dT E dY .This ĉ ovð ĜE ; YÞ is therefore equal to the genetic covariance estimate ôEY used by LAVA-rG, and thus both LAVA-TWAS and LAVA-rG are shown to use the same test statistic dT E dY ¼ ôEY .We can also observe that this ôEY can be rewritten as the sum P j dEj dYj , weighting each genetic association dYj by the corresponding eQTL estimate dEj .This implementation is therefore analogous to how TWAS is performed using GWAS summary statistics in other TWAS methods (eg.Gusev (2016) [6]), except defined in terms of the estimated genetic associations of principal component matrix W rather than the original SNP genotype matrix X.

Data
Genotype data for simulations came from the UK Biobank [28], a national genomic study of over 500,000 volunteer participants in the UK.The study was approved by the National Research Ethics Service Committee North West-Haydock (reference 11/NW/0382) and data were accessed under application #16406.Data collection, primary quality control, and imputation of the genotype data were performed by the UK Biobank [29].Data was converted to hardcalled genotypes, and then additionally filtered, using only SNPs with an info score of at least 0.9 and missingness no greater than 5%, and restricting the sample to unrelated, Europeanancestry individuals with concordant sex.Remaining missing genotype values were then mean-imputed to ensure consistency of the input data across different methods.
The European panel of the 1,000 Genomes [43] data (N = 503, as downloaded from https:// ctg.cncr.nl/software/magma)was used as genotype reference data to estimate LD for the real data analysis for both LAVA and FUSION, removing SNP with a minor allele frequency (MAF) below 0.5%.For eQTL data we used the GTEx [34] data (v8, European subset), for whole blood gene expression.The published cis-eQTL summary statistics (European ancestry) from GTEx Portal were used for the LAVA analysis, pre-computed FUSION model files were used for the FUSION analysis.For every analysed gene, we included all SNPs in the data within one megabase of the transcription start site.Genes were filtered to include only autosomal protein-coding and RNA genes expressed in blood, and only genes also available in the FUSION model files were used, resulting in a total of 14,584 genes available for analysis.GWAS summary statistics were selected for five well-powered phenotypes, chosen to reflect a range of different domains.These were BMI (GIANT) [30] (no waist-hip ratio adjustment), educational attainment (SSGAC) [31], schizophrenia (PGC, wave 3) [32], diastolic blood pressure (GWAS Atlas) [33] and type 2 diabetes (GWAS Atlas) [33].Sample size and number of SNPs for each sample can be found in Table 2.

Simulation study
Simulation studies were performed to evaluate the type 1 error rates of TWAS when used to evaluate the null hypothesis H 0 :cov(G E , G Y ) = 0.The simulations were based on the UK Biobank data, generating gene expression and outcome phenotypes while varying their sample sizes and the local heritabilities, as well as the number of SNPs per gene.The number of SNPs K in the genes was varied across values of 100, 500, 1,000 and 2,500 SNPs; sample sizes of either 250 or 1,000 were used for the gene expression (denoted N E ), and 10,000, 50,000 or 100,000 for the outcome phenotype (denoted N Y ); and local heritabilities, defined as the percentage of variance explained by the SNPs in the gene, were set at either 1%, 5% or 10% for the gene expression (denoted h 2 E ) [44], and 0.001%, 0.005%, 0.01%, 0.05%, 0.1%, 0.5% or 1% for the outcome phenotype (denoted h 2 Y ).The number of SNPs K was only varied at sample size settings of N E = 1,000 and N Y = 10,000, and the sample sizes were only varied at K = 1,000.All combinations of the two local heritabilities were simulated for each setting of K and sample sizes.The simulations were analysed using LAVA-TWAS, LAVA-rG, the four FUSION models and CoMM.
To create the samples to use, we first drew a random subset of 100,000 individuals from the UK Biobank data.For each sample size N used in the simulations, the first N individuals from this subset were used for that sample, to ensure that samples of different sizes are all nested.The MAF for all SNPs was computed based on the N = 100 subsample, discarding all SNPs with MAF below 5%.Since the subsamples are nested, this ensures that all SNPs have a sufficient minor allele count at every sample size used.Ten blocks of 2,500 consecutive SNPs were then selected to represent simulated genes, one block each from around the middle of the first ten chromosomes (avoiding the centromere region).The number of SNPs K was varied across conditions by using only a subset of these 2,500 SNPs, always selecting the first K SNPs from each block.For each simulation condition, the total number of iterations for that condition was evenly distributed over these ten blocks, and results were aggregated over the blocks.Type 1 error rates per condition were computed as the proportion of iterations for that condition for which the p-value was smaller than the significance threshold, which was set to 0.05 unless otherwise specified.
To simulate gene expression E and outcome phenotype Y for the standardized genotypes X of the SNPs in a gene, X was projected onto standardized principal components W, pruning away redundant components based on the cumulative genotypic variance explained by the components (retaining those that jointly explain 99% of the total variance).For each iteration, true genetic effect sizes δ E and δ Y for the principal components under the null hypothesis of cov(G E , G Y ) = 0 were generated by drawing values from a normal distribution, then regressing one vector on the other and retaining only the residuals for the outcome vector to ensure that δ E and δ Y were exactly independent.The two genetic components were then computed as G E = Wδ E and G Y = Wδ Y , and both were subsequently standardized to a variance of one.
Simulated gene expression and phenotype values were then generated as E = Wδ E +ξ E and Y = Wδ Y +ξ Y , drawing the residuals ξ E and ξ Y from normal distributions with variance parameters equal to respectively, ensuring the desired level of explained variance.Both E and Y were then regressed on each individual SNP in X using simple linear regression, and the resulting test statistics as well as the simulated E and Y themselves were stored for subsequent analysis.
To compare the different methods, a baseline set of simulation conditions with 1,000 iterations, with K = 1,000, N E = 1,000 and the remaining parameters across their full range, was generated and analysed with all methods: LAVA-TWAS, LAVA-rG, FUSION-BLUP, FUSION-BSLMM, FUSION-ElastNet, FUSION-LASSO and CoMM.LAVA analyses were performed using the summary statistics as input, FUSION analyses with the raw gene expression values for the first stage and outcome phenotype summary statistics for input, and CoMM using the raw values for both gene expression and outcome phenotype.For the first stage of FUSION, cross-validation was turned off and the true R 2 E ¼ h 2 E 100 value was set using the -hsq_set option.In the CoMM analyses, the model was found to run into convergence issues due to K being equal to N E , and the simulations were rerun setting K to 100 instead to resolve this.Moreover, an additional set of simulations was performed with h 2 e values of 1%, 2%, 4%, 5%, 6%, 8% and 10%, and h 2 Y values of 0.01%, 0.02%, 0.05%, 0.1%, 0.2%, 0.5% and 1%.For LAVA-TWAS, further of simulations with 10,000 iterations were performed across the full range of parameter settings for analysis, to evaluate the impact of these parameters on type 1 error rates.To determine the effect of filtering genes on the presence of eQTL signal, as is common practice in TWAS analyses, for these simulations we also computed the LAVA univariate p-value for the gene expression.In addition to the regular type 1 error rates, filtered type 1 error rates were also computed based on only the iterations for which the univariate pvalue was below either 0.05 or 0.0001.Finally, to allow for reliable computation of type 1 error rates at lower significance thresholds, for the subset of these conditions with K = 1,000, N E = 1,000, h 2 E ¼ 5% and h 2 Y ¼ 0:001%; 0:01%; 0:1% or 1% the number of iterations was increased to 100,000.Note that although for LAVA-TWAS the number of iterations differs from the other methods and varies across conditions, for each plot in this paper the type 1 error rate values within that plot are always based on the same number of iterations.
Additional simulations were also performed under the TWAS null hypothesis of H 0 : covð ĜE ; G Y Þ ¼ 0 to validate the LAVA-TWAS model.To do so, the above procedure was modified to first generate δ E and the corresponding G E , from which E was then simulated.This was used to estimate SNP summary statistics for the associations between X and E, from which the estimate vector dE was then computed for the outcome data.Finally, δ Y was then generated to be exactly independent of this dE , with the rest of the simulation process proceeding as normal.For h 2 Y , the parameter range was extended to also include h 2 Y ¼ 0%.

Real data analysis
TWAS and local genetic correlation analyses were performed on the GWAS data as follows, separately for each of the five phenotypes.In all the analyses, SNP filtering was applied to remove all SNPs with a minor allele frequency below 0.5%, and only SNPs available in the 1,000 Genomes data, the GTEx data, and the GWAS sample for that phenotype were used.For every gene, univariate LAVA analysis for the eQTL signal, LAVA-rG and LAVA-TWAS analyses, and FUSION analyses with the Elastic Net and LASSO models was performed.For FUSION, the BLUP and BSLMM models were not used as these were not available in the precomputed model files.FUSION univariate p-values for the eQTL signal were also obtained from these model files.Note that separate SNP filtering was applied prior to these models being computed, and as such the LAVA and FUSION analyses are based on different sets of SNPs per gene.
For the primary analyses, the bivariate LAVA and FUSION analyses were only performed for genes for which the LAVA univariate tests were significant after correcting for the total number of genes, at a threshold of 0.05/14,584 = 3.43×10 −6 (see Table A in S1 Text for results when using the FUSION univariate p-values instead).The significance threshold for the bivariate analyses was set separately for each phenotype, at a Bonferroni-correction for the total number of genes that was univariate significant for that phenotype (see Table 2).In a secondary analysis, to evaluate TWAS results in the absence of gene expression signal, bivariate analyses were performed for all genes for which the LAVA univariate p-value was greater than 0.05, again setting the significance threshold per phenotype for the number of genes with univariate p-value greater than 0.05 for that phenotype.

additional results. Fig A. Type 1 error rate simulation results as a function of the number of
SNPs in the gene.Shown is the type 1 error rate (at significance threshold of 0.05) of the LAVA-TWAS model for the null hypothesis of no genetic covariance (cov(G E , G Y ) = 0), at different levels of local heritability (h2) for outcome phenotype (horizontal axis).Results are shown for sample sizes of 1,000 and 10,000 for the expression and outcome respectively, at 5% local heritability for the expression.As shown, the number of SNPs interacts with the local heritability of the outcome, with lower numbers of SNPs resulting in a lower type 1 error rate at lower heritability, but increasing more rapidly as a function of heritability than conditions with higher numbers of SNPs.Fig B .Type 1 error rate simulation results as a function of univariate input filtering.Shown is the type 1 error rate (at significance threshold of 0.05) of the LAVA-TWAS model for the null hypothesis of no genetic covariance (cov(G E , G Y ) = 0), at different levels of local heritability (h2) for outcome phenotype (horizontal axis) and different gene filtering settings.The univariate filtering is based on the test of genetic signal for the gene expression implemented in LAVA, applying either no filtering or subsetting to iterations of the simulation with univariate p-value smaller than 0.05, 0.001 or 0.0001 (corresponding to multiple testing correction for 1, 50 or 500 genes respectively) before computing the type 1 error rate on the remaining iterations.Results are shown for a sample size (N) of 1,000 and a local heritability of 5% for the expression, and 1,000 SNPs.As shown, filtering at a univariate p-value threshold of 0.05 has little effect, but at the stricter thresholds the type 1 error rate becomes further increased.As shown, the results are very similar across the different models.The Elastic Net and LASSO models do show decreased type 1 error rate for the 1% expression heritability condition, which is due to these models more frequently failing to converge if there is little detectable genetic signal for the expression, making the TWAS analysis inherently non-significant in those cases (whereas the BLUP and Bayesian LMM models always yield an âE ).  for using 100 rather than 1,000 SNPs.As shown, type 1 error rate inflation is generally more pronounced than for the other methods.Moreover, there is a (partial) reversal of the effect of the local heritability of the expression, with the error rates for the 10% heritability level generally in between those of 1% and 5% heritability.Subsequently, additional simulations were performed using a more fine-grained range of heritability values to further investigate this phenomenon, with results shown on the bottom row with local heritability for the gene expression now on the horizontal axis and for the outcome as separate lines.These additional results confirm the non-monotonic effect of gene expression heritability on type 1 error rate, and demonstrate an interaction between it and the heritability and sample size of the outcome as well.See also section Discussion of simulation results above.Fig F. P-value bias in the real data analysis for BMI.Shown is the relation between the -log10 p-values of LAVA-TWAS model against those of the LAVA-rG model, demonstrating the bias in the LAVA-TWAS p-values as a result of its failure to account for the uncertainty in ĜE .On the left a scatterplot of all the -log10 p-values from the analysis, on the right the same values truncated to 1.0×10 −10 to better visualize the region around the Bonferroni-corrected significance threshold (horizontal and vertical dashed lines).In green the genes significant for both LAVA-TWAS and LAVA-rG, in red the invalid associations significant in LAVA-TWAS only.Fig G .Comparison of -log10 p-values for different implementations of LAVA-rG.P-values in the published implementation of LAVA-rG are computed using a partially empirical, sampling-based approach.Shown here is a comparison with p-values computed using a normal approximation, results for which are used throughout this paper rather than those based on the empirical p-values.Here a comparison is given of all the LAVA-rG p-values from the simulations (left) and real-data analysis (right).As shown, these p-values are virtually identical, and as such results based on either approach can be considered interchangeable.error rate (at significance threshold of 0.05) of the four penalized regression models in FUSION for the null hypothesis of no genetic covariance (cov(G E , G Y ) = 0), at different levels of local heritability (h2) for the outcome phenotype (horizontal axis).Results are shown for 5% heritability for the gene expression, and sample sizes of 1,000 and 100,000 for the expression and outcome respectively, with 1,000 SNPs.Significance is declared either when the permutation p-value is below 0.05 (black lines), or when the permutation p-value and the main analysis p-value are both below 0.05 (yellow lines).As shown, type 1 error rates are generally well controlled for the Elastic Net and LASSO models, and well controlled for the Bayesian LMM model when using both p-values, though for the BLUP model inflation remains even when using both p-values.
unconfounded by other differences in model or implementation.
Fig D in S1 Text).Similar patterns of type 1 error rate inflation were found for CoMM as well and were generally more pronounced, though the influence of the gene expression heritability was more complex than for the other methods (Fig E in S1 Text).This is due to specific constraints in the model, which are analogous to how traditional TWAS models constrain α Y to be proportional to âE in the multiple regression model in Eq (3) (see Comparison with the CoMM model in S1 Text).As in the validation simulations in Werme (2022) [26], type 1 error rates for LAVA-rG were found to be well-controlled (Fig 3).

Fig 1 .
Fig 1. Type 1 error rate simulation results as a function of sample size and local heritability.Shown is the type 1 error rate (at significance threshold of 0.05) of the LAVA-TWAS model for the null hypothesis of no genetic covariance (cov(G E , G Y ) = 0), at different levels of local heritability (h2) for outcome phenotype (horizontal axis) and gene expression (separate lines).Results are shown for different sample sizes (N) for the expression (rows) and outcome (columns) data, and at 1,000 SNPs.As shown, the type 1 error rates get considerably larger with increasing sample size or local heritability for the outcome.Conversely, increasing these for the expression reduced the type 1 error rate, but the effect of this is much less pronounced.https://doi.org/10.1371/journal.pgen.1010921.g001

Fig 2 .
Fig 2. Type 1 error rate inflation simulations results as a function of significance threshold.Shown is the type 1 error rate inflation of the LAVA-TWAS model for the null hypothesis of no genetic covariance (cov(G E , G Y ) = 0), for different levels of significance threshold α.The type 1 error rate inflation is defined as the type 1 error rate divided by the significance rate α, and equals 1 if the error rates are well-controlled.Results are shown for simulations with expression sample size of 1,000, expression local heritability of 5%, and 1,000 SNPs.As shown, the type 1 error rate becomes relatively more inflated the lower the significance threshold used (separate lines), with the difference becoming more pronounced at higher local heritability (h2) (horizontal axis) and sample size (N) (separate panels) for the outcome.https://doi.org/10.1371/journal.pgen.1010921.g002

Fig 3 .
Fig 3. Comparison of type 1 error rate simulation results across models.Shown is the type 1 error rate (at significance threshold of 0.05) of the five different TWAS models and LAVA-rG for the null hypothesis of no genetic covariance (cov(G E , G Y ) = 0), at different levels of local heritability (h2) for outcome phenotype (horizontal axis) and gene expression (separate panels).Results are shown for sample sizes of 1,000 and 100,000 for the expression and outcome respectively, with 1,000 SNPs.As shown, the results are very similar across the different models.The lower type 1 error rate for Elastic Net and LASSO compared to the other TWAS models for the 1% expression heritability condition is due to these two models sometimes failing to converge at very low genetic signal for the expression (see also Fig D in S1 Text).https://doi.org/10.1371/journal.pgen.1010921.g003 Fig 4 illustrates  the relation between this standard error deflation and the resulting bias in p-values, using the BMI results as an example, with Fig F in S1 Text further showing the level of realized bias.This high variability of standard error deflation across genes also demonstrates that this issue must be addressed at the level of the individual gene, and cannot be resolved by applying a simple post-hoc correction to the gene p-values.

Fig 4 .
Fig 4. Illustration of p-value bias as a result of deflated standard errors.Shown in the left panel are the p-values that will result from a LAVA-TWAS analysis at different levels of standard error deflation, when the true p-value computed with correct standard errors would either be 0.05 (left vertical axis) or 0.0001 (right vertical axis).The standard error deflation values are taken from the BMI analysis, with the large red point denoting the median and the other red points the 5 th , 25 th , 75 th and 95 th quantiles of their distribution (see Table3).As shown, the relative bias is stronger the lower the p-value.For example, at the median standard error deflation of 0.89, at a reference p-value of 0.05 the biased p-value is about 0.028, about a 1.8-fold decrease in the p-value relative to its true value.At a reference p-value of 0.0001 however, the biased p-value is about 1.2×10 −5 , an 8.1-fold decrease.In the right panel is the general relationship between the true reference p-value and the p-value that would result at different levels of standard error deflation, based on the percentiles for the BMI analysis.This further illustrates the relative increase in bias for lower p-values, with the gap with the reference p-value (black line) getting larger at lower reference pvalues.Note that all axes with p-values use a -log10(p-value) scaling, but labels are expressed as p-values for ease of interpretation. https://doi.org/10.1371/journal.pgen.1010921.g004 Fig C. Type 1 error rate simulation results for LAVA-TWAS under the TWAS null hypothesis.Shown is the type 1 error rate (at significance threshold of 0.05) of LAVA-TWAS under the TWAS null hypothesis covð ĜE ; G Y Þ ¼ 0), at different levels of local heritability (h2) for outcome phenotype (horizontal axis) and gene expression (separate lines).Results are shown for sample sizes of 1,000 and 100,000 for the expression and outcome respectively, with 1,000 SNPs.As shown, and in contrast to the results when testing the null hypothesis of no genetic covariance (cov(G E , G Y ) = 0, Fig 1), type 1 error rates here are well controlled regardless of condition.Fig D. Type 1 error rate simulation results for FUSION models.Shown is the type 1 error rate (at significance threshold of 0.05) of the four penalized regression models in FUSION for the null hypothesis of no genetic covariance (cov (G E , G Y ) = 0), at different levels of local heritability (h2) for outcome phenotype (horizontal axis) and gene expression (separate lines).Results are shown for sample sizes of 1,000 and 100,000 for the expression and outcome respectively, with 1,000 SNPs, thus corresponding to the results for LAVA-TWAS shown in the bottom-right panel of Fig 1.

Fig E. Type 1
error rate simulation results for CoMM.Shown on the top row is the type 1 error rate (at significance threshold of 0.05) of CoMM for the null hypothesis of no genetic covariance (cov(G E , G Y ) = 0), at different levels of local heritability (h2) for outcome phenotype (horizontal axis) and gene expression (separate lines).Results are shown for sample sizes of 1,000 for the expression, and sample sizes of 10,000 (left) and 100,000 (right) for the outcome.Results in the top-right panel thus correspond to the LAVA-TWAS results in the bottom-right panel of Fig 1 and the FUSION results in Fig D, except Fig H. Type 1 error rate simulation results for FUSION models with post hoc permutation test.Shown is the type 1 blood eQTL summary statistics and the gene definition file used for the LAVA-TWAS and LAVA-rG analyses were obtained via https://www.gtexportal.org/home/datasets, the corresponding FUSION model files from http://gusevlab.org/projects/fusion/#single-tissue-gene-expression (EUR Samples).The European subset of the 1000 Genomes data used as LD reference for the real data analyses was downloaded from https://ctg.cncr.nl/software/magma.UK Biobank genotype data was obtained from https://www.ukbiobank.ac.uk/ under application #16406.LAVA is implemented as an R package, which is publicly available at the LAVA website (https://ctg.cncr.nl/software/lava) and LAVA GitHub repository (https:// github.com/josefin-werme/LAVA).GTEx v8 input files compatible with LAVA are available from the LAVA website as well.Version 0.1.0 of LAVA was used for the LAVA analyses in this paper; this version contains specific functionality for handling eQTL summary statistics input files.FUSION can

Table 1 . Overview of available TWAS analysis methods. Method Weight estimation a Base model extensions
LASSO: least absolute shrinkage and selection operator; BLUP: best linear unbiased predictor; LMM: linear mixed model a Multiple entries for a method denote different options; 'marginal' refers to marginal SNP effect sizes being used as weights, 'external' means the method requires precomputed weights from an external source b The name 'FUSION' and the LASSO and elastic net options for this method were added after publication of the Gusev (2016) paper https://doi.org/10.1371/journal.pgen.1010921.t001 and thus will be zero if and only if cov(G E , G Y ) is zero, the genetic relationship between gene expression and the phenotype can be tested using the null hypothesisH 0 :cov(G E , G Y ) = 0.However, by contrast, TWAS performs a test of covð ĜE ; G Y Þ rather than cov(G E , G Y ), as shown above.To see how these two covariances relate to each other, we can partition the esti-mated genetic component ĜE as ĜE ¼ G E þ DE ,with DE the deviation of the sample estimate from the true genetic component G E .Again using the distributive property of covariance, we can then conclude covð ĜE

Table 2 . Summary of results of TWAS and local genetic correlation analyses of published summary statistics for five phenotypes. Phenotype Sample size a Number of SNPs b Genes tested Significance threshold LAVA signif. associations FUSION signif. associations r G TWAS Overlap % c Elast. net LASSO
Genes were included for bivariate testing if they exhibited significant univariate eQTL signal in the LAVA univariate test at α BONF = 0.05/14,584 = 3.43×10 −6 (see Table A in S1 Text for results filtered on FUSION univariate test).Results were Bonferroni corrected per phenotype for the number genes tested (see Methods -Real data analysis).Note that all significant LAVA-rG associations were also significant in LAVA-TWAS.
a Showing case/control for binary phenotypes b After filtering for overlap with 1,000 Genomes and GTEx SNPs c Percentage of significant LAVA-TWAS associations that were also LAVA-rG associations https://doi.org/10.1371/journal.pgen.1010921.t002

Table 3 . Summary of TWAS standard error deflation values per gene.
Based on genes with significant univariate eQTL signal in the LAVA univariate test, see Table2.The standard error deflation is calculated as the ratio of the standard error for genes in the LAVA-TWAS model (not correcting for uncertainty in the estimate) to the corresponding standard errors in the LAVA-rG model (calibrated for testing H 0 :cov(G E , G Y ) = 0).https://doi.org/10.1371/journal.pgen.1010921.t003 1), with I K the size K identity matrix) and similarly dY ¼ W T Y NÀ 1 , with corresponding sampling distributions dE � MVNðd E ; s 2 E IÞ and dY � MVNðd Y ; s 2 Y IÞ, where s 2 E ¼ .For principal component j we therefore have dj � MVNðd j ; ŜÞ, where the diagonal elements of Ŝ are ŝ2 Y

Table A .
Summary of results of TWAS and local genetic correlation analyses of published summary statistics for five phenotypes, with FUSION univariate filtering.

Table B .
Summary of results of TWAS and local genetic correlation analyses for genes with no detectable eQTL signal.Table C. Summary of results of FUSION permutation tests for published summary statistics for five phenotypes.(DOCX)